A HPar: A Practical Parallel Parser for HTML –Taming HTML Complexities for Parallel Parsing
نویسندگان
چکیده
Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependences in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms one of the final barriers for fully parallelizing browser operations to minimize the browser’s response time—an important variable for user experiences, especially on portable devices. This paper provides a comprehensive analysis on the special complexities of parallel HTML parsing, and presents a systematic exploration in overcoming those difficulties through specially designed speculative parallelizations. This work develops, to the best of our knowledge, the first pipelining and data-level parallel HTML parsers. The data-level parallel parser, named HPar, achieves up to 2.4x speedup on quadcore devices. This work demonstrates the feasibility of efficient, parallel HTML parsing for the first time, and offers a set of novel insights for parallel HTML parsing.
منابع مشابه
Parallel Parsing: The Earley and Packrat Algorithms
Parsing plays a critical role in our modern computer infrastructure: scripting languages such as Python and JavaScript, layout languages such as HTML, CSS, and Postscript/PDF, and data exchange languages such as XML and JSON are all interpreted, and so require parsing. Moreover, by some estimates, the time spent parsing while producing a rendered page from HTML, CSS, and JavaScript is as much a...
متن کاملStatic Validation of Dynamically Generated HTML Documents Based on Abstract Parsing and Semantic Processing
Abstract parsing is a static-analysis technique for a program that, given a reference LR(k) context-free grammar, statically checks whether or not every dynamically generated string output by the program conforms to the grammar. The technique operates by applying an LR(k) parser for the reference language to data-flow equations extracted from the program, immediately parsing all the possible st...
متن کاملPAPAGENO: A Parallel Parser Generator for Operator Precedence Grammars
In almost all language processing applications, languages are parsed employing classical algorithms (such as the LR(1) parsers generated by Bison), which are sequential due to their left-to-right state-dependent nature. Although early theoretical studies on parallel parsing algorithms delineated potential speedups on abstract parallel machines using a data-parallel approach, practical developme...
متن کاملLanguage-Independent Text Parsing of Arbitrary HTML-Documents. Towards A Foundation For Web Genre Identification
This article describes an approach to parsing and processing arbitrary web pages in order to detect macrostructural objects such as headlines, explicitlyand implicitly-marked lists, and text blocks of different types. The text parser analyses a document by means of several processing stages and inserts the analysis results directly into the DOM tree in the form of XML elements and attributes, s...
متن کاملXSS-FP: Browser Fingerprinting using HTML Parser Quirks
There are many scenarios in which inferring the type of a client browser is desirable, for instance to fight against session stealing. This is known as browser fingerprinting. This paper presents and evaluates a novel fingerprinting technique to determine the exact nature (browser type and version, eg Firefox 15) of a web-browser, exploiting HTML parser quirks exercised through XSS. Our experim...
متن کامل